Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added support for Databricks Unity Catalog #1331

Merged
merged 3 commits into from
May 6, 2023

Conversation

nohajc
Copy link
Contributor

@nohajc nohajc commented May 3, 2023

Description

I'm currently working with Azure Databricks which uses Unity Catalog for storing Delta Lake table paths and I thought it would be nice if delta-rs provided basic support for it.

There isn't any Databricks SDK library for Rust as far as I know, so this change introduces direct dependency to reqwest, reqwest-middleware and reqwest-retry crates when the unity feature flag is enabled.

Documentation

https://docs.databricks.com/api-explorer/workspace/tables/get

Notes

I manually tested against Azure Databricks workspace but the implementation should work with any cloud provider offering Databricks.

This is my first contribution, so I welcome any feedback. Thanks.

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate documentation Improvements or additions to documentation rust labels May 3, 2023
@github-actions
Copy link

github-actions bot commented May 3, 2023

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@nohajc nohajc changed the title Added support for Databricks Unity Catalog feat: added support for Databricks Unity Catalog May 3, 2023
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution @nohajc - left some initial comments, but will give this a more thorough review tomorrow.


impl UnityCatalog {
/// Creates a new UnityCatalog
pub fn new() -> Result<Self, DataCatalogError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our clients etc, we usually adopt a builder pattern, that allows us to configure options on the client before building the client. We would then have some methos like UnitiCatalogClientBuilder::from_env() to support parsing the configuration from the environment.

That said, haven't worked with our catalogs for a long time, so not sure if we properly pass things through right now to make that work. IF not, this maybe something we want to address in a a follow-up.


impl From<reqwest::Error> for DataCatalogError {
fn from(value: reqwest::Error) -> Self {
reqwest_middleware::Error::Reqwest(value).into()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit convoluted. We are wrapping the actual error in another error to then convert it again. Not sure how many variants the middleware error has, but maybe its viable to unpack the middleware error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be only two variants, unpacking sounds like a good idea. I was just trying to solve a type checking error, didn't give it much thought...

Copy link
Contributor Author

@nohajc nohajc May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the middleware error is defined like this:

#[derive(Error, Debug)]
pub enum Error {
    /// There was an error running some middleware
    #[error("Middleware error: {0}")]
    Middleware(#[from] anyhow::Error),
    /// Error from the underlying reqwest client
    #[error("Request error: {0}")]
    Reqwest(#[from] reqwest::Error),
}

I could flatten the hierarchy by adding the same variants into GetTableError but both still represent RequestError so I don't like it much. Basically, the conversion was introduced only because of the reqwest::Client retry wrapper. Otherwise I would have just the one reqwest::Error type.

Copy link
Contributor Author

@nohajc nohajc May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also convert reqwest::Error to anyhow::Error but then I'm losing type information... Not sure it's even good practice to return anyhow from library code. It makes sense for the middleware itself but mixing it with the underlying request error seems wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would be nice to unify some of the DataCatalogError variants across implementations but I'd prefer to do it separately.


/// Possible errors from the unity-catalog/tables API call
#[derive(thiserror::Error, Debug)]
pub enum GetTableError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using these error variants somewhere?

While still very much a long term things, we have been trying to reduce the number of error variants we expose to users in this crate and also make them more actionable. Maybe we can make some common errors explicit - like 403?

Copy link
Contributor Author

@nohajc nohajc May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it in a similar way to the existing AWS glue module but I understand it's not perfect.

Didn't want to start refactoring existing code either. At least not yet.

@roeap
Copy link
Collaborator

roeap commented May 4, 2023

Had a look through our current code and thought a bit about where we might want to go with our catalogues in general. I landed on taking a similar approach to what we did in the object_store crate. There we went through quite some effort to make everything SDK-free. Good news is the unitiy catalog implementation already follows that pattern to some extend :).

The reasoning is quite similar. Our needs towards what we want from the APIs is usually a very small subset of what the overall catalog exposes, so implementing a few rest calls should be quite maintainable. Thought about just exposing the client from object_store, but I guess this would violate our goal over there to keep the API contact with consumers minimal and not expose any internals.

So how to go from there? I would be happy to merge this PR rather soon, but maybe we can rename the feature flag to something like unitiy-experimental. So we can do a bit of refactoring afterwards. Essentially this might entail:

  • Porting over the http client from object store, and adopting it for the needs of the catalog APIs. (i.e. getting rid of rusoto dependency in glue client.
  • Establishing a builders / configuration system like object_store (which is in turn what we use here for configuring storage)
  • and some more stuff I likely missed :).

We could then also leverage the mock server implementation from over there to test the clients.

cc @wjones127 @rtyler @houqp @MrPowers

@nohajc
Copy link
Contributor Author

nohajc commented May 4, 2023

  • Porting over the http client from object store, and adopting it for the needs of the catalog APIs. (i.e. getting rid of rusoto dependency in glue client.

By porting over you mean duplicating the code or separating it into its own crate?

When I was going through the dependencies to find out what kind of http clients are used in the project, I saw that object_store already depends on reqwest. So, how does the client built on top of it compare to reqwest-middleware with reqwest-retry?

Anyway, I'd be happy to contribute to this refactoring effort if I find the time.

@roeap
Copy link
Collaborator

roeap commented May 4, 2023

By porting over you mean duplicating the code or separating it into its own crate?

For now I would say, duplicating and adjusting, a new crate somewhere between an http client and the sdks brings not much value to the community - I think.

So, how does the client built on top of it compare to reqwest-middleware with reqwest-retry?

They probably do many things quite similar, but since we already have a well tested implementation that does not require these dependencies, we may as well use them :). More importantly though, we implemented quite a few different authorization mechanisms for the various cloud providers that we can leverage to also integrate with other catalogs like Pruview in Azure, for glue to to remove rusoto, or whatever google has.

In my experience, providing credentials for the sometimes quite different and complex ways of acquiring credentials in the various environments a code can run in, is usually the hardest part of this. So the main aim is to leverage as much of this as possible.

@tustvold - I just discounted the idea, that we want to make the http client and some credentials from object_store public, but maybe you have some thoughts on that? Maybe we could also find synergies with datafusion when it comes to implementing catalog clients?

@nohajc
Copy link
Contributor Author

nohajc commented May 4, 2023

More importantly though, we implemented quite a few different authorization mechanisms for the various cloud providers that we can leverage to also integrate with other catalogs like Pruview in Azure, for glue to to remove rusoto, or whatever google has.

In my experience, providing credentials for the sometimes quite different and complex ways of acquiring credentials in the various environments a code can run in, is usually the hardest part of this. So the main aim is to leverage as much of this as possible.

This makes a lot of sense but I'd say complex auth flows are another reason for code reuse rather than duplication. Even if it didn't have value in the community when the SDKs already exist, it still has value as a common building block of the ecosystem you're building.

Just my two cents.

@nohajc
Copy link
Contributor Author

nohajc commented May 5, 2023

I renamed the feature flag as suggested.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nohajc, from my end we can merge this.

I should get around to add some tests for this soon, and hopefully migrate the glue catalog.

I'll leave this open for a little bit in case others want to chime in.

@dennyglee
Copy link
Collaborator

Thanks for this PR @nohajc - once this gets merged, please ping me via Delta Users Slack - dennyglee so we can send you some swag for your first Delta Lake contribution, eh?! Thanks!

@djouallah
Copy link

Does this support reading Databricks Delta table with merge on read, deletion vector etc

@dennyglee
Copy link
Collaborator

Does this support reading Databricks Delta table with merge on read, deletion vector etc

Hey @djouallah - good to hear from you! As noted in this PR, it's supporting basic unity catalog integration hence it's' not tackling MoR, deletion vectors. Why not create issues related to them as we have discussed deletion vector (and then afterwards MoR) support but I don't think we've had a chance to create an issue for this. And PRs are always welcome :)

@roeap roeap merged commit 0b7cc06 into delta-io:main May 6, 2023
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jun 7, 2023

@nohajc @roeap
I am not able to connect to the unity catalog with my code, how did you get it working?

import os
from deltalake import DataCatalog, DeltaTable

os.environ['DATABRICKS_WORKSPACE_URL'] = URL
os.environ['DATABRICKS_ACCESS_TOKEN'] = SECRET

catalog_name = 'private_catalog'
schema_name = 'private_schema'
table_name = 'private_table'
data_catalog = DataCatalog.UNITY
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, data_catalog_id=catalog_name, database_name=schema_name, table_name=table_name)

It results in this:

PyDeltaTableError: Failed to load checkpoint: Failed to read checkpoint content: Generic MicrosoftAzure error: Error authorizing request: Error performing token request: response error "request error", after 0 retries: error sending request for url (http://<ip.address>/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com%2F.default): error trying to connect: tcp connect error: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (os error 10060)

@nohajc
Copy link
Contributor Author

nohajc commented Jun 7, 2023

@nohajc @roeap I am not able to connect to the unity catalog with my code, how did you get it working?

I didn't have to do anything special. My access token just worked. This could be any number of things... maybe a firewall issue?

@roeap
Copy link
Collaborator

roeap commented Jun 7, 2023

From the error message it seems that communication with unity catalog worked, but the credentials to access the storage are not available.

We are working on improving those apis. For now you could likely use the environment to configure storage credentials.

At least thats my best guess based on the error message.

@nohajc
Copy link
Contributor Author

nohajc commented Jun 7, 2023

That's right, I forgot to mention I also had the azure storage account credentials in the environment. This is not explicitly said in the documentation but it is a requirement.

@ion-elgreco
Copy link
Collaborator

@nohajc @roeap I am not able to connect to the unity catalog with my code, how did you get it working?

I didn't have to do anything special. My access token just worked. This could be any number of things... maybe a firewall issue?

Could be, the databricks workspaces are managed by our platform teams.

That's right, I forgot to mention I also had the azure storage account credentials in the environment. This is not explicitly said in the documentation but it is a requirement.

Ah, that would make sense, I don't have this, nor would I get the account credentials from our platform team. I am only able to access the table with my personal credentials through the catalog.

Would it be possible to expose unity catalog tables without storage credentials to delta-rs?

@nohajc
Copy link
Contributor Author

nohajc commented Jun 8, 2023

Would it be possible to expose unity catalog tables without storage credentials to delta-rs?

Well, I think you have to authenticate against the storage one way or another. If this is Azure, it could be as simple as running az login in your shell session. Assuming your user has the correct permissions.

@ion-elgreco
Copy link
Collaborator

Would it be possible to expose unity catalog tables without storage credentials to delta-rs?

Well, I think you have to authenticate against the storage one way or another. If this is Azure, it could be as simple as running az login in your shell session. Assuming your user has the correct permissions.

Just to update this worked. Now I have two ways to read delta tables, directly on filepath or through UC schema's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate documentation Improvements or additions to documentation rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants